Please take a look at the Ames Hoursing data.
library(AmesHousing)
?ames_raw
The goal of this exercise is to predict the price of the house.
Here is a histogram of the sales price with red line showing the mean.
ggplot(ames_raw)+geom_histogram(fill="skyblue")+aes(x=SalePrice)+
geom_vline(xintercept = mean(ames_raw$SalePrice),col="red")
summary(ames_raw$SalePrice)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 12789 129500 160000 180796 213500 755000
Initial linear model without a predictor
lmfit_null<- lm(SalePrice~1,ames_raw)
summary(lmfit_null)
##
## Call:
## lm(formula = SalePrice ~ 1, data = ames_raw)
##
## Residuals:
## Min 1Q Median 3Q Max
## -168007 -51296 -20796 32704 574204
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 180796 1476 122.5 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 79890 on 2929 degrees of freedom
How good is this result? Let’s look at RMSE.
sqrt(mean(residuals(lmfit_null)^2))
## [1] 79873.06
Since the price is right skewed lets log transformation the outcome
ggplot(ames_raw)+geom_histogram(fill="skyblue")+aes(x=SalePrice)+geom_vline(xintercept = exp(mean(log(ames_raw$SalePrice))),col="red")+scale_x_log10()
Fitting the same model on the log transformed outcome
lmfit_null_log<- lm(log(SalePrice)~1,ames_raw)
summary(lmfit_null_log)
##
## Call:
## lm(formula = log(SalePrice) ~ 1, data = ames_raw)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.56463 -0.24953 -0.03804 0.25042 1.51350
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.02097 0.00753 1596 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4076 on 2929 degrees of freedom
RMSE is
sqrt(mean((ames_raw$SalePrice-exp(predict(lmfit_null_log)))^2))
## [1] 81195.11
Notice that the RMSE is actually bigger with log transformed model. So should we not transform? What do we get from the transformation?
Here is the prediction uncertainty overlayed on the histogram.
intpred=predict(lmfit_null,interval="prediction")
## Warning in predict.lm(lmfit_null, interval = "prediction"): predictions on current data refer to _future_ responses
ggplot(ames_raw)+geom_histogram(fill="skyblue")+aes(x=SalePrice)+geom_vline(xintercept = mean(ames_raw$SalePrice),col="red")+geom_vline(xintercept = intpred[1,2],col="red",lty=2)+geom_vline(xintercept = intpred[1,3],col="red",lty=2)
intpredlog=predict(lmfit_null_log,interval="prediction")
## Warning in predict.lm(lmfit_null_log, interval = "prediction"): predictions on current data refer to _future_ responses
ggplot(ames_raw)+geom_histogram(fill="skyblue")+aes(x=SalePrice)+geom_vline(xintercept = exp(mean(log(ames_raw$SalePrice))),col="red")+scale_x_log10()+
geom_vline(xintercept = exp(intpredlog[1,2]),col="red",lty=2)+
geom_vline(xintercept = exp(intpredlog[1,3]),col="red",lty=2)
The log model seem to have a better uncertainty estimate. What good does that do?
Let’s say the model is for an algorithm that buys the house. If you pay more than the true price the company buys. If the price is lower, then the company fails to buy. - If you bought for more than the true value you’ve over paid. - If you bid less and lost, you lost a profit of the 10% of the house price.
Based on such loss function what is our overall loss if we base our decision on this model?
allres<-residuals(lmfit_null)
abs(sum(allres[allres<0]))+sum(0.1*(coef(lmfit_null)+allres[allres>0]))
## [1] 114215680
allreslog<-ames_raw$SalePrice-exp(predict(lmfit_null_log))
abs(sum(allreslog[allreslog<0]))+sum(0.1*(exp(coef(lmfit_null_log))+allreslog[allreslog>0]))
## [1] 94211655
As you can see with a better calibrated model you have a better performance for more realistic loss.
Gr Liv AreaWe add a predictor Gr Liv Area
p=ggplot(ames_raw)+geom_point()+aes(x=`Gr Liv Area`,y=SalePrice)+xlab("Above grade (ground) living area square feet")+ylab("Sale Price")+geom_smooth(method="lm",se=FALSE)
p3 <- ggMarginal(p, margins = 'both', fill="skyblue", size=4,type="histogram")
p3
Using Gr Liv Area as predictor
lmfit_liv_area_0<- lm(SalePrice~`Gr Liv Area`,ames_raw)
lmfit_liv_area_1<- lm(log(SalePrice)~`Gr Liv Area`,ames_raw)
lmfit_liv_area_2<- lm(log(SalePrice)~log(`Gr Liv Area`),ames_raw)
gridExtra::grid.arrange(
ggplot(ames_raw)+geom_point()+aes(x=`Gr Liv Area`,y=SalePrice)+xlab("Above grade (ground) living area square feet")+ylab("Sale Price")+geom_smooth(method="lm",se=FALSE),
ggplot(ames_raw)+geom_point()+aes(x=`Gr Liv Area`,y=SalePrice)+xlab("Above grade (ground) living area square feet")+ylab("Sale Price")+geom_smooth(method="lm",se=FALSE)+scale_y_log10(),
ggplot(ames_raw)+geom_point()+aes(x=`Gr Liv Area`,y=SalePrice)+xlab("Above grade (ground) living area square feet")+ylab("Sale Price")+geom_smooth(method="lm",se=FALSE)+scale_y_log10()+scale_x_log10(),
qplot(predict(lmfit_liv_area_0),residuals(lmfit_liv_area_0))+geom_smooth(method="rlm")+geom_hline(yintercept = 0,lty=2),
qplot(predict(lmfit_liv_area_1),residuals(lmfit_liv_area_1))+geom_smooth(method="rlm")+geom_hline(yintercept = 0,lty=2),
qplot(predict(lmfit_liv_area_2),residuals(lmfit_liv_area_2))+geom_smooth(method="rlm")+geom_hline(yintercept = 0,lty=2),
ncol=3
)
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: Computation failed in `stat_smooth()`
## Computation failed in `stat_smooth()`
## Computation failed in `stat_smooth()`
## Caused by error in `get()`:
## ! object 'rlm' of mode 'function' was not found
Because of the skewness it’s better to take log on both x and y.
p=ggplot(ames_raw)+geom_point()+aes(x=`Gr Liv Area`,y=SalePrice)+xlab("Above grade (ground) living area square feet")+ylab("Sale Price")+geom_smooth(method="lm",se=FALSE)+scale_y_log10()+scale_x_log10()
p3 <- ggMarginal(p, margins = 'both', fill="skyblue", size=4,type="histogram")
p3
lm_mod_1 <- lm(log(SalePrice)~log(`Gr Liv Area`),ames_raw)
summary(lm_mod_1)
##
## Call:
## lm(formula = log(SalePrice) ~ log(`Gr Liv Area`), data = ames_raw)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.0778 -0.1465 0.0264 0.1740 0.8602
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.43019 0.11644 46.63 <2e-16 ***
## log(`Gr Liv Area`) 0.90781 0.01602 56.66 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2816 on 2928 degrees of freedom
## Multiple R-squared: 0.523, Adjusted R-squared: 0.5228
## F-statistic: 3210 on 1 and 2928 DF, p-value: < 2.2e-16
However, the residual still shows heterogeneous spread.
plot(predict(lm_mod_1),resid(lm_mod_1),main="residual plot");abline(h=0,lty=2,col="grey")
library(quantreg)
qu<-rq(resid(lm_mod_1)~predict(lm_mod_1),tau = c(0.1, 0.9))
colors <- c("#ffe6e6", "#ffcccc", "#ff9999", "#ff6666", "#ff3333",
"#ff0000", "#cc0000", "#b30000", "#800000", "#4d0000", "#000000")
for (j in 1:ncol(qu$coefficients)) {
abline(coef(qu)[, j], col = colors[j])
}
Did we reduce the residual variability?
logSalePrice<-log(ames_raw$SalePrice)
clogSalePricec <-logSalePrice-mean(logSalePrice)
rlogSalePricec<-resid(lm_mod_1)
labs<-c("mean","regression")
names(labs)<-c("clogSalePricec","rlogSalePricec")
ggplot(melt(data.frame(clogSalePricec,rlogSalePricec)))+
geom_histogram(fill="skyblue")+
aes(x=value,color=variable)+geom_vline(xintercept = 0,col="red")+
facet_grid(variable~.,labeller = labeller(variable = labs))
anova(lm_mod_1)
## Analysis of Variance Table
##
## Response: log(SalePrice)
## Df Sum Sq Mean Sq F value Pr(>F)
## log(`Gr Liv Area`) 1 254.47 254.470 3210 < 2.2e-16 ***
## Residuals 2928 232.12 0.079
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Looking at correlation with predictors.
#`Lot Area`,`Year Built`,`Year Remod/Add`,`Total Bsmt SF`,`1st Flr SF`,`2nd Flr SF`,`Low Qual Fin SF`,`Gr Liv Area`,`Full Bath`,`Half Bath`,`Bsmt Full Bath`,`Bsmt Half Bath`,`TotRmsAbvGrd`,`Garage Area`,`Wood Deck SF`,`Open Porch SF`,`Pool Area`,`Sale Type`
M <-cor(ames_raw[,c("Lot Area","Year Built","Year Remod/Add","Total Bsmt SF","1st Flr SF","2nd Flr SF","Low Qual Fin SF","Gr Liv Area","Full Bath","Half Bath","Bsmt Full Bath","Bsmt Half Bath","TotRms AbvGrd","Garage Area","Wood Deck SF","Open Porch SF","Pool Area","SalePrice")],use="pairwise.complete.obs")
corrplot(M, type="upper", order="hclust",
col=brewer.pal(n=8, name="RdYlBu"))
mames<-melt(ames_raw[,c("Lot Area","Year Built","Year Remod/Add","Total Bsmt SF","1st Flr SF","2nd Flr SF","Low Qual Fin SF","Gr Liv Area","Full Bath","Half Bath","Bsmt Full Bath","Bsmt Half Bath","TotRms AbvGrd","Garage Area","Wood Deck SF","Open Porch SF","Pool Area","SalePrice")],id.vars = "SalePrice")
ggplot(mames)+geom_point()+aes(x=value,y=SalePrice)+facet_wrap(variable~.,scales = "free")
## Warning: Removed 6 rows containing missing values (`geom_point()`).
Lot AreaWhen looking at lot area, its no surprise to have some relationship with the price. But the relationship is not clear linear one. Why?
ggplot(ames_raw)+geom_point()+aes(x=`Lot Area`,y=SalePrice)+xlab("Lot size in square feet")+ylab("Sale Price")+geom_smooth(method="lm",se=FALSE)+scale_y_log10()+scale_x_log10()#+facet_grid(~`Bedroom AbvGr`)
If you look at this by the neighborhood it become obvious how in some places size matters more than others.
ggplot(ames_raw)+geom_point(alpha=0.3)+aes(x=`Lot Area`,y=SalePrice,color=factor(Neighborhood))+xlab("Above grade (ground) living area square feet")+ylab("Sale Price")+geom_smooth(method="lm",se=FALSE)+scale_y_log10()+scale_x_log10()+ theme(legend.position = "none")+facet_wrap(~Neighborhood)
To make the project more realistic, I will split the data into before 2008 and after. The data up to 2008 will be the training data nd after will be the testing data.
ames_raw_2008=ames_raw[ames_raw$`Yr Sold`<2008,]
ames_raw_2009=ames_raw[ames_raw$`Yr Sold`>=2008,]
If you look at the time trend, it seems the price is fairly stable over the years.
ames_raw$saledt <- ym(paste0(ames_raw$`Yr Sold`,"-",ames_raw$`Mo Sold`))
rolling_median <- function(formula, data, n_roll = 11, ...) {
x <- data$x[order(data$x)]
y <- data$y[order(data$x)]
y <- zoo::rollmedian(y, n_roll, na.pad = TRUE)
structure(list(x = x, y = y, f = approxfun(x, y)), class = "rollmed")
}
predict.rollmed <- function(mod, newdata, ...) {
setNames(mod$f(newdata$x), newdata$x)
}
p=ggplot(ames_raw)+geom_point()+aes(x=saledt,y=SalePrice)+xlab("Sale Date")+ylab("Sale Price")+scale_y_log10()+geom_smooth(formula = y ~ x, method = "rolling_median",se=FALSE)+geom_vline(xintercept=ym("2008-01"),lty=2,col="orange")
p3 <- ggMarginal(p, margins = 'y', fill="skyblue", size=4,type="histogram")
p3
Fitting the null model on the training data
lmfit_null_2008 <- lm(SalePrice~1,ames_raw_2008)
lmfit_null_log_2008 <- lm(log(SalePrice)~1,ames_raw_2008)
Comparing the MSE
sqrt(mean((ames_raw_2009$SalePrice-predict(lmfit_null,newdata=ames_raw_2009))^2))
## [1] 77587.03
sqrt(mean((ames_raw_2009$SalePrice-exp(predict(lmfit_null_log,newdata=ames_raw_2009)))^2))
## [1] 78531.29
Comparing the business loss
allres_2009=ames_raw_2009$SalePrice-predict(lmfit_null,newdata=ames_raw_2009)
abs(sum(allres_2009[allres_2009<0]))+sum(0.1*(coef(lmfit_null_2008)+allres_2009[allres_2009>0]))
## [1] 63806382
allreslog_2009<-(ames_raw_2009$SalePrice-exp(predict(lmfit_null_log,newdata=ames_raw_2009)))
abs(sum(allreslog_2009[allreslog_2009<0]))+sum(0.1*(exp(coef(lmfit_null_log_2008))+allreslog_2009[allreslog_2009>0]))
## [1] 52684502
Use data of ames_raw up to 2008 predict the housing
price for the later years.
ames_raw_2008=ames_raw[ames_raw$`Yr Sold`<2008,]
ames_raw_2009=ames_raw[ames_raw$`Yr Sold`>=2008,]
Use the following loss function calculator.
calc_loss<-function(prediction,actual){
difpred <- actual-prediction
difpred = na.omit(difpred)
RMSE <-sqrt(mean(difpred^2))
operation_loss<-abs(sum(difpred[difpred<0]))+sum(0.1*actual[difpred>0])
return(
list(RMSE,operation_loss
)
)
}
Here are few rules: - You are not allowed to use the test data. - You cannot use automatic variable selection. - You need to explain why you added each variable.
lmfit_2008<- lm(log(SalePrice) ~ log(`Lot Area`)+`Year Built`+`Year Remod/Add` + `Total Bsmt SF` + `1st Flr SF` + log(`Gr Liv Area`) + `Bsmt Full Bath` + `TotRms AbvGrd` + `Garage Area` + `Wood Deck SF` + `Open Porch SF` , data = ames_raw_2008) # use ames_raw_2008
lmfit_2008
##
## Call:
## lm(formula = log(SalePrice) ~ log(`Lot Area`) + `Year Built` +
## `Year Remod/Add` + `Total Bsmt SF` + `1st Flr SF` + log(`Gr Liv Area`) +
## `Bsmt Full Bath` + `TotRms AbvGrd` + `Garage Area` + `Wood Deck SF` +
## `Open Porch SF`, data = ames_raw_2008)
##
## Coefficients:
## (Intercept) log(`Lot Area`) `Year Built` `Year Remod/Add`
## -5.573e+00 8.383e-02 2.898e-03 3.078e-03
## `Total Bsmt SF` `1st Flr SF` log(`Gr Liv Area`) `Bsmt Full Bath`
## 1.588e-04 -4.433e-05 6.828e-01 6.376e-02
## `TotRms AbvGrd` `Garage Area` `Wood Deck SF` `Open Porch SF`
## -3.040e-02 2.378e-04 1.167e-04 -1.455e-04
When you decide on your model use the following to come up with your test loss.
pred_2009<-exp(predict(lmfit_2008,newdata=ames_raw_2009))
calc_loss(pred_2009,ames_raw_2009$SalePrice)
Try to answer the following additional questions.
Your code:
summary(lmfit_2008)
##
## Call:
## lm(formula = log(SalePrice) ~ log(`Lot Area`) + `Year Built` +
## `Year Remod/Add` + `Total Bsmt SF` + `1st Flr SF` + log(`Gr Liv Area`) +
## `Bsmt Full Bath` + `TotRms AbvGrd` + `Garage Area` + `Wood Deck SF` +
## `Open Porch SF`, data = ames_raw_2008)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.48256 -0.08492 0.00058 0.09723 0.68601
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.573e+00 5.456e-01 -10.214 < 2e-16 ***
## log(`Lot Area`) 8.383e-02 1.094e-02 7.661 3.57e-14 ***
## `Year Built` 2.898e-03 2.298e-04 12.614 < 2e-16 ***
## `Year Remod/Add` 3.078e-03 3.153e-04 9.762 < 2e-16 ***
## `Total Bsmt SF` 1.588e-04 1.967e-05 8.076 1.50e-15 ***
## `1st Flr SF` -4.433e-05 2.275e-05 -1.949 0.05150 .
## log(`Gr Liv Area`) 6.828e-01 2.934e-02 23.276 < 2e-16 ***
## `Bsmt Full Bath` 6.376e-02 1.007e-02 6.331 3.35e-10 ***
## `TotRms AbvGrd` -3.040e-02 5.274e-03 -5.764 1.02e-08 ***
## `Garage Area` 2.378e-04 2.996e-05 7.938 4.41e-15 ***
## `Wood Deck SF` 1.167e-04 3.937e-05 2.964 0.00309 **
## `Open Porch SF` -1.455e-04 7.253e-05 -2.005 0.04513 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1721 on 1306 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.8178, Adjusted R-squared: 0.8163
## F-statistic: 532.9 on 11 and 1306 DF, p-value: < 2.2e-16
#
Your answer:
It is a good model as all the predictors are statistically significant and the Adjusted R-squared is .8163 which is pretty high.
Your code:
#
lmfit_2008all = lm(log(SalePrice) ~ log(`Lot Area`)+`Year Built`+`Year Remod/Add` + `Total Bsmt SF` + `1st Flr SF` + `2nd Flr SF`+ `Low Qual Fin SF` + log(`Gr Liv Area`) + `Full Bath` + `Half Bath`+ `Bsmt Full Bath` + `Bsmt Half Bath` + `TotRms AbvGrd` + `Garage Area` + `Wood Deck SF` + `Open Porch SF` + `Pool Area`, data = ames_raw_2008)
pred_2009<-exp(predict(lmfit_2008all,newdata=ames_raw_2009))
calc_loss(pred_2009,ames_raw_2009$SalePrice)
## [[1]]
## [1] 43840.48
##
## [[2]]
## [1] 32470209
summary(lmfit_2008all)
##
## Call:
## lm(formula = log(SalePrice) ~ log(`Lot Area`) + `Year Built` +
## `Year Remod/Add` + `Total Bsmt SF` + `1st Flr SF` + `2nd Flr SF` +
## `Low Qual Fin SF` + log(`Gr Liv Area`) + `Full Bath` + `Half Bath` +
## `Bsmt Full Bath` + `Bsmt Half Bath` + `TotRms AbvGrd` + `Garage Area` +
## `Wood Deck SF` + `Open Porch SF` + `Pool Area`, data = ames_raw_2008)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.52376 -0.08264 0.00067 0.09629 0.76399
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.809e+00 7.184e-01 -8.086 1.40e-15 ***
## log(`Lot Area`) 8.272e-02 1.099e-02 7.529 9.54e-14 ***
## `Year Built` 2.976e-03 2.644e-04 11.259 < 2e-16 ***
## `Year Remod/Add` 3.139e-03 3.199e-04 9.812 < 2e-16 ***
## `Total Bsmt SF` 1.572e-04 2.000e-05 7.862 7.89e-15 ***
## `1st Flr SF` -3.392e-05 4.335e-05 -0.783 0.43405
## `2nd Flr SF` 1.850e-05 3.895e-05 0.475 0.63486
## `Low Qual Fin SF` -1.491e-04 1.158e-04 -1.288 0.19810
## log(`Gr Liv Area`) 6.794e-01 6.135e-02 11.075 < 2e-16 ***
## `Full Bath` -1.313e-02 1.427e-02 -0.920 0.35760
## `Half Bath` -7.569e-03 1.438e-02 -0.526 0.59863
## `Bsmt Full Bath` 6.460e-02 1.051e-02 6.144 1.07e-09 ***
## `Bsmt Half Bath` 2.497e-02 1.912e-02 1.306 0.19174
## `TotRms AbvGrd` -3.000e-02 5.418e-03 -5.538 3.70e-08 ***
## `Garage Area` 2.371e-04 3.005e-05 7.891 6.34e-15 ***
## `Wood Deck SF` 1.080e-04 3.997e-05 2.703 0.00695 **
## `Open Porch SF` -1.416e-04 7.339e-05 -1.930 0.05381 .
## `Pool Area` 1.631e-05 1.019e-04 0.160 0.87290
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1722 on 1300 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.8185, Adjusted R-squared: 0.8161
## F-statistic: 344.8 on 17 and 1300 DF, p-value: < 2.2e-16
Your answer:
lmfit_2008some = lm(log(SalePrice) ~ log(`Lot Area`)+`Year Built`+`Year Remod/Add` + `Total Bsmt SF` + `1st Flr SF` + `Low Qual Fin SF` + log(`Gr Liv Area`) + `Bsmt Full Bath` + `Bsmt Half Bath` + `TotRms AbvGrd` + `Garage Area` + `Wood Deck SF` + `Open Porch SF` , data = ames_raw_2008)
pred_2009<-exp(predict(lmfit_2008some,newdata=ames_raw_2009))
calc_loss(pred_2009,ames_raw_2009$SalePrice)
## [[1]]
## [1] 43015.08
##
## [[2]]
## [1] 32494178
summary(lmfit_2008some)
##
## Call:
## lm(formula = log(SalePrice) ~ log(`Lot Area`) + `Year Built` +
## `Year Remod/Add` + `Total Bsmt SF` + `1st Flr SF` + `Low Qual Fin SF` +
## log(`Gr Liv Area`) + `Bsmt Full Bath` + `Bsmt Half Bath` +
## `TotRms AbvGrd` + `Garage Area` + `Wood Deck SF` + `Open Porch SF`,
## data = ames_raw_2008)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.50374 -0.08347 -0.00057 0.09616 0.76774
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.567e+00 5.468e-01 -10.181 < 2e-16 ***
## log(`Lot Area`) 8.280e-02 1.095e-02 7.561 7.51e-14 ***
## `Year Built` 2.861e-03 2.314e-04 12.368 < 2e-16 ***
## `Year Remod/Add` 3.101e-03 3.153e-04 9.835 < 2e-16 ***
## `Total Bsmt SF` 1.577e-04 1.972e-05 7.998 2.76e-15 ***
## `1st Flr SF` -4.636e-05 2.276e-05 -2.037 0.04185 *
## `Low Qual Fin SF` -1.558e-04 1.121e-04 -1.390 0.16481
## log(`Gr Liv Area`) 6.868e-01 2.938e-02 23.372 < 2e-16 ***
## `Bsmt Full Bath` 6.686e-02 1.025e-02 6.520 1.00e-10 ***
## `Bsmt Half Bath` 2.653e-02 1.894e-02 1.401 0.16158
## `TotRms AbvGrd` -2.979e-02 5.283e-03 -5.640 2.09e-08 ***
## `Garage Area` 2.374e-04 3.001e-05 7.913 5.35e-15 ***
## `Wood Deck SF` 1.102e-04 3.956e-05 2.786 0.00541 **
## `Open Porch SF` -1.418e-04 7.251e-05 -1.955 0.05079 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.172 on 1304 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.8183, Adjusted R-squared: 0.8165
## F-statistic: 451.9 on 13 and 1304 DF, p-value: < 2.2e-16
I tried delete some predictor which has a high p value in the lmfit_2008all model(model with all predictor), however the loss is higher than before. So I think the lmfit_2008all model(model with all predictor) can best predict the price. While I got the highest Adjusted R-squared in lmfit_2008some after delete several predictor.
Your code:
lmfit_2008some = lm(log(SalePrice) ~ log(`Lot Area`)+`Year Built`+`Year Remod/Add` + `Total Bsmt SF` + `1st Flr SF` + `Low Qual Fin SF` + log(`Gr Liv Area`) + `Bsmt Full Bath` + `Bsmt Half Bath` + `TotRms AbvGrd` + `Garage Area` + `Wood Deck SF` + `Open Porch SF` , data = ames_raw_2008)
pred_2009<-exp(predict(lmfit_2008some,newdata=ames_raw_2009))
calc_loss(pred_2009,ames_raw_2009$SalePrice)
## [[1]]
## [1] 43015.08
##
## [[2]]
## [1] 32494178
Your answer:
All the predictors are statistically significant, I don't know how to add a interaction.
Your code:
#
#
Your answer:
Please write your answer in full sentences.
Your code:
#
#
Your answer:
Please write your answer in full sentences.
Your code:
#
#
Your answer:
Please write your answer in full sentences.
Auto data set.Your code:
Auto = read.csv('Auto.csv')
lmfitAuto = lm(mpg ~ horsepower, data = Auto)
summary(lmfitAuto)
##
## Call:
## lm(formula = mpg ~ horsepower, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.5710 -3.2592 -0.3435 2.7630 16.9240
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.935861 0.717499 55.66 <2e-16 ***
## horsepower -0.157845 0.006446 -24.49 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.906 on 390 degrees of freedom
## Multiple R-squared: 0.6059, Adjusted R-squared: 0.6049
## F-statistic: 599.7 on 1 and 390 DF, p-value: < 2.2e-16
Your answer:
Horsepower is statistically significant. When increase one horsepower, mpg is expected to decrease 0.158.
Comment on the output. For example: i. Is there a relationship between the predictor and the response?
predicted_mpg <- predict(lmfitAuto, newdata = data.frame(horsepower = 98), interval = "confidence")
predicted_mpg
## fit lwr upr
## 1 24.46708 23.97308 24.96108
abline()
function to display the least squares regression line.Your code:
ggplot(Auto)+
geom_point()+
aes(x = horsepower, y = mpg)+
xlab("horsepower")+
ylab("mpg")+
geom_abline(slope = -0.158, intercept = 39.93, col = "blue", size = 1)+
theme_bw()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
#
Your answer:
Please write your answer in full sentences.
plot() function to produce diagnostic plots of
the least squares regression fit. Comment on any problems you see with
the fit.Your code:
plot(lmfitAuto)
Your answer:
These plot show that the residual is not a normal distribution which mean that the model could be improved.
Auto data set.Your code:
pairs(subset(Auto, select = -name), pch = 1)
Your answer:
From this graph, there is obvious correlation between several predictors.
Your code:
cor(subset(Auto, select = -name))
## mpg cylinders displacement horsepower weight
## mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
## cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
## displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
## horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
## weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
## acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392
## year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199
## origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054
## acceleration year origin
## mpg 0.4233285 0.5805410 0.5652088
## cylinders -0.5046834 -0.3456474 -0.5689316
## displacement -0.5438005 -0.3698552 -0.6145351
## horsepower -0.6891955 -0.4163615 -0.4551715
## weight -0.4168392 -0.3091199 -0.5850054
## acceleration 1.0000000 0.2903161 0.2127458
## year 0.2903161 1.0000000 0.1815277
## origin 0.2127458 0.1815277 1.0000000
#
Your answer:
High corelation between displacement, cylinders, horsepower and weight.
Your code:
lmfitAutoall = lm(mpg ~ cylinders + displacement + horsepower + weight + acceleration + year + origin, data = Auto)
summary(lmfitAutoall)
##
## Call:
## lm(formula = mpg ~ cylinders + displacement + horsepower + weight +
## acceleration + year + origin, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5903 -2.1565 -0.1169 1.8690 13.0604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.218435 4.644294 -3.707 0.00024 ***
## cylinders -0.493376 0.323282 -1.526 0.12780
## displacement 0.019896 0.007515 2.647 0.00844 **
## horsepower -0.016951 0.013787 -1.230 0.21963
## weight -0.006474 0.000652 -9.929 < 2e-16 ***
## acceleration 0.080576 0.098845 0.815 0.41548
## year 0.750773 0.050973 14.729 < 2e-16 ***
## origin 1.426141 0.278136 5.127 4.67e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared: 0.8215, Adjusted R-squared: 0.8182
## F-statistic: 252.4 on 7 and 384 DF, p-value: < 2.2e-16
Your answer:
Displacement, weight, year, origin are significant while others are not. There is a relationship betweeen the predictors and the response as the R square is high.
Comment on the output. For instance: i. Is there a relationship between the predictors and the response? ii. Which predictors appear to have a statistically significant relationship to the response?
Your code:
plot(lmfitAutoall)
#
Your answer:
325, 324, 321 are unusual outliers. 14 has a high leverage.
Your code:
# Fit a linear regression model with interactions between two predictors
lmAuto_i <- lm(mpg ~ cylinders * displacement +
cylinders * horsepower +
cylinders * weight +
cylinders * acceleration +
cylinders * year +
cylinders * origin +
displacement * horsepower +
displacement * weight +
displacement * acceleration +
displacement * year +
displacement * origin +
horsepower * weight +
horsepower * acceleration +
horsepower * year +
horsepower * origin +
weight * acceleration +
weight * year +
weight * origin +
acceleration * year +
acceleration * origin +
year * origin,
data = Auto)
summary(lmAuto_i)
##
## Call:
## lm(formula = mpg ~ cylinders * displacement + cylinders * horsepower +
## cylinders * weight + cylinders * acceleration + cylinders *
## year + cylinders * origin + displacement * horsepower + displacement *
## weight + displacement * acceleration + displacement * year +
## displacement * origin + horsepower * weight + horsepower *
## acceleration + horsepower * year + horsepower * origin +
## weight * acceleration + weight * year + weight * origin +
## acceleration * year + acceleration * origin + year * origin,
## data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.6303 -1.4481 0.0596 1.2739 11.1386
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.548e+01 5.314e+01 0.668 0.50475
## cylinders 6.989e+00 8.248e+00 0.847 0.39738
## displacement -4.785e-01 1.894e-01 -2.527 0.01192 *
## horsepower 5.034e-01 3.470e-01 1.451 0.14769
## weight 4.133e-03 1.759e-02 0.235 0.81442
## acceleration -5.859e+00 2.174e+00 -2.696 0.00735 **
## year 6.974e-01 6.097e-01 1.144 0.25340
## origin -2.090e+01 7.097e+00 -2.944 0.00345 **
## cylinders:displacement -3.383e-03 6.455e-03 -0.524 0.60051
## cylinders:horsepower 1.161e-02 2.420e-02 0.480 0.63157
## cylinders:weight 3.575e-04 8.955e-04 0.399 0.69000
## cylinders:acceleration 2.779e-01 1.664e-01 1.670 0.09584 .
## cylinders:year -1.741e-01 9.714e-02 -1.793 0.07389 .
## cylinders:origin 4.022e-01 4.926e-01 0.816 0.41482
## displacement:horsepower -8.491e-05 2.885e-04 -0.294 0.76867
## displacement:weight 2.472e-05 1.470e-05 1.682 0.09342 .
## displacement:acceleration -3.479e-03 3.342e-03 -1.041 0.29853
## displacement:year 5.934e-03 2.391e-03 2.482 0.01352 *
## displacement:origin 2.398e-02 1.947e-02 1.232 0.21875
## horsepower:weight -1.968e-05 2.924e-05 -0.673 0.50124
## horsepower:acceleration -7.213e-03 3.719e-03 -1.939 0.05325 .
## horsepower:year -5.838e-03 3.938e-03 -1.482 0.13916
## horsepower:origin 2.233e-03 2.930e-02 0.076 0.93931
## weight:acceleration 2.346e-04 2.289e-04 1.025 0.30596
## weight:year -2.245e-04 2.127e-04 -1.056 0.29182
## weight:origin -5.789e-04 1.591e-03 -0.364 0.71623
## acceleration:year 5.562e-02 2.558e-02 2.174 0.03033 *
## acceleration:origin 4.583e-01 1.567e-01 2.926 0.00365 **
## year:origin 1.393e-01 7.399e-02 1.882 0.06062 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.695 on 363 degrees of freedom
## Multiple R-squared: 0.8893, Adjusted R-squared: 0.8808
## F-statistic: 104.2 on 28 and 363 DF, p-value: < 2.2e-16
Your answer:
displacement:year, acceleration:year, acceleration:origin are significant.
I add all the interactions and choose the interactions with low p value.
This question should be answered using the Carseats data set. (a) Fit a multiple regression model to predict Sales using Price, Urban, and US.
Your code:
library(ISLR2)
lmCarseats = lm(Sales ~ Price + Urban + US, data = Carseats)
summary(lmCarseats)
##
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9206 -1.6220 -0.0564 1.5786 7.0581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
## Price -0.054459 0.005242 -10.389 < 2e-16 ***
## UrbanYes -0.021916 0.271650 -0.081 0.936
## USYes 1.200573 0.259042 4.635 4.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
## F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
Your answer:
It is not a good model as the R is too low. While p value of F is pretty high.
Your answer:
We failed to reject that urban won't influence Sales.
When US is yes, sales is expected to increase 1.2 compare to US is no when others fix.
When price increase 1, sales is expected to decrease 0.05 when others fix.
Your answer:
$$ \text{Sales} = 13.043469 - 0.054459 \times \text{Price} - 0.021916 \times \text{UrbanYes} + 1.200573 \times \text{USYes} + \varepsilon $$
Your answer:
Price and US.
Your code:
lmCarseats_s = lm(Sales ~ Price + US, data = Carseats)
summary(lmCarseats_s)
##
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9269 -1.6286 -0.0574 1.5766 7.0515
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.03079 0.63098 20.652 < 2e-16 ***
## Price -0.05448 0.00523 -10.416 < 2e-16 ***
## USYes 1.19964 0.25846 4.641 4.71e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354
## F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
Your answer:
The model in (e) is a little bit better than the model in (a). But as the ajusted R square is too low, the model both didn't fit well.
Your code:
conf_intervals <- confint(lmCarseats_s)
conf_intervals
## 2.5 % 97.5 %
## (Intercept) 11.79032020 14.27126531
## Price -0.06475984 -0.04419543
## USYes 0.69151957 1.70776632
Your answer:
Please write your answer in full sentences.
Your answer:
plot(lmCarseats_s)
Yes, 69, 37, 51 are outliers and the point 26, 50, 368 are high leverage observations.
You can do KNN regression using FNN package. Read more about it here: https://daviddalpiaz.github.io/r4sl/knn-reg.html
library(FNN)
knn.reg(train = ?, test = ?, y = ?, k = ?)
There are few platforms in R that does predictive modeling. These platforms are wrappers around other packages that makes it easy to do routine tasks.
# split the data
index <- sample(1:nrow(ames_raw), 0.7*nrow(ames_raw))
vars <- c("SalePrice","Lot Area","Gr Liv Area","Full Bath")
train <- ames_raw[ index, vars]
test <- ames_raw[-index, vars]
colnames(train) <- make.names(colnames(train))
colnames(test) <- make.names(colnames(test))
# mlr3 TaskRegr
train$SalePrice <- log(train$SalePrice)
MLR3 is a useful ML platform on R. You can learn about it here: https://mlr3book.mlr-org.com There is MLR and MLR3. It’s better to use 3 since MLR is no longer actively developed.
# load packages and data
library(mlr3)
library(mlr3learners)
# fit a model
task <- as_task_regr(train, target ="SalePrice",id = "ames_raw")
# TaskRegr$new(id = "ames_raw", backend = train, target ="SalePrice")
learner <- lrn("regr.lm", predict_type = "response")
learner$train(task)
prediction=predict(learner,newdata=test)
Tidymodels is another one of the tidy family that is similar to mlr3. https://www.tidymodels.org/
# load packages and data
library(tidymodels)
library(dotwhisker)
# fit a model
rec <- recipe(SalePrice ~ ., data = train)
clf <- linear_reg()
wflow <- workflow() %>%
add_recipe(rec) %>%
add_model(clf)
lm_fit <- wflow %>% fit(data = train)
prediction=predict(lm_fit,new_data=test)
tidy(lm_fit) %>%
dwplot(dot_args = list(size = 2, color = "black"),
whisker_args = list(color = "black"),
vline = geom_vline(xintercept = 0, colour = "grey50", linetype = 2))
# load packages and data
library(caret)
# fit a model
ctrl <- trainControl(method = "none")
lm_model <- train(SalePrice ~ ., data = train, method = "lm", trControl = ctrl)
prediction=predict(lm_model,newdata = test)
H2O is a cross platform library that is popular in the predictive modeling space. They work well out of box and can be called from any platform independent of the language used. Making it work on R could sometimes be a headache. So try using H2O but if you cannot make it work, I recommend you leave this section alone.
If you are on Mac you will need to install Java (http://www.java.com ).
install.packages("h2o")
# load packages and data
library(h2o)
packageVersion("h2o")
To use H2O you need to instantiate it.
# nthreads specifies number of threads. -1 means use all the CPU cores.
# max_mem_size specifies the maximum amount of RAM to use.
localH2O <- h2o.init(nthreads = -1, max_mem_size="4g")
You can access H2O instance using the web UI FLOW by typing http://localhost:54321 in your browser.
Since H2O is not in R, you need to tell it to use your data.
train_hf <- as.h2o(train)
test_hf <- as.h2o(test)
gaussian.fit = h2o.glm(y = "SalePrice", #response variable
x = c("SalePrice","Lot.Area","Gr.Liv.Area","Full.Bath"), #predictor variables
training_frame = train_hf, #data
family = "gaussian",lambda = 0) #specify the dist. of y and penalty parameter: lambda
gaussian.fit
prediction=predict(gaussian.fit,newdata = test_hf)
h2o.exportFile(prediction, "/tmp/pred.csv", force = TRUE) #export prediction result as a file
# save the model
model_path <- h2o.saveModel(object=gaussian.fit, path=getwd(), force=TRUE)
print(model_path)
# load the model
saved_model <- h2o.loadModel(model_path) #extract the saved model
saved_model
h2o.shutdown(prompt =F)
The idea of regression started with Euler(1949) when he was studying the motion of planets. He wanted to predict the location of Jupiter. Euler had 75 observations for a linear model with seven unknown constants; equivalently, he had 75 equations and seven unknowns.
Let’s translate this into a more familiar notation that we are used to. We will let \[\begin{eqnarray*} y_i=\varphi_i\verb|, |\boldsymbol{\beta}=\left(\begin{array}{c} \beta_1\\ \beta_2\\ \vdots\\ \beta_p \end{array}\right) \verb|, and | \mathbf{X}_i=\left(\begin{array}{c} x_{i1} \\ x_{i2} \\ \vdots\\ x_{ip} \\ \end{array}\right) \end{eqnarray*}\]
Then, we can write the equation as \[\begin{eqnarray*} y_i=\mathbf{X}_i^T\boldsymbol{\beta}=x_{i1}\beta_1+x_{i2}\beta_2+\cdots+x_{ip}\beta_p \verb| for |i=1,\cdots,n=75 \end{eqnarray*}\]
There are a few assumptions that went into the model
From calculus, we know that when we have seven unknowns, we need seven equations to estimate those unknowns. Since Euler had 75 equations, he needed a way to convert those 75 equations into seven equations. The simplest thing to do is randomly choose seven equations out of 75 equations and solve for the unknowns using only the selected equations. But Laplace came up with a better idea: to combine 75 equations so that you get seven equations as weighted linear combinations of the 75 equations. That is, each equation is
\[\begin{eqnarray*} \sum^n_{i=1}w_iy_i=\sum^n_{i=1}w_i\mathbf{X}_i^T\boldsymbol{\beta} \end{eqnarray*}\]If we combine all the weights into a matrix \[\begin{eqnarray*} \mathbf{W}=\left( \begin{array}{cccc} w_{11}&w_{12}&\cdots &w_{1p} \\ \vdots&\vdots&\ddots&\vdots\\ w_{n1}&w_{n2}&\cdots &w_{np} \\ \end{array} \right) \end{eqnarray*}\]
Then 75 equations \(Y_{75\times 1}=\mathbf{X}_{75\times 7}\boldsymbol{\beta}_{7\times1}\) can be expressed as \[\begin{eqnarray*} (\mathbf{W}^TY)_{7\times 1}=(\mathbf{W}^T\mathbf{X})_{7\times 7}\boldsymbol{\beta} \rightarrow (\mathbf{W}^T\mathbf{X})^{-1}(\mathbf{W}^TY)_{7\times 1}=\hat{\hspace{0pt}\boldsymbol{\beta}} \end{eqnarray*}\]
This makes us ask a new question: what \(\mathbf{W}\) do we want to get the optimal estimate of \(\boldsymbol{\beta}\)?
Let us start with a case where we only have one unknown variable. we have \[\begin{eqnarray*} \mathbf{X}=\left[ \begin{array}{c} x_{1} \\ \vdots\\ x_{n} \\ \end{array} \right]\verb|, | Y=\left[ \begin{array}{c} y_{1} \\ \vdots\\ y_{n} \\ \end{array} \right]\verb|, and | \mathbf{W}=\left[ \begin{array}{c} w_{1} \\ \vdots\\ w_{n} \\ \end{array} \right] \end{eqnarray*}\]
In this case, we can combine \(n\) equations as \[\begin{eqnarray*} \sum^n_{i=1}w_iy_i=\sum^n_{i=1}w_i\mathbf{X}_i^T\boldsymbol{\beta} \end{eqnarray*}\] solving for \(\boldsymbol{\beta}\) gives us \[\begin{eqnarray*} \hat{\hspace{0pt}\boldsymbol{\beta}}&=&\frac{\sum_i^n w_iy_i}{\sum_i^n w_i x_i} \\ &=&\frac{\langle \mathbf{W}, Y\rangle}{\langle \mathbf{W}, \mathbf{X}\rangle} \end{eqnarray*}\]
What is the optimal \(\mathbf{W}\) in this case? Let’s formulate this problem in a statistical framework.
We need to make a few assumptions As already stated, we will have the following assumptions.
We will start with the estimating equation \[\begin{eqnarray*} \hat{\hspace{0pt}\boldsymbol{\beta}}&=&\frac{\langle \mathbf{W}, \mathbf{X}\boldsymbol{\beta}_{true}+\epsilon \rangle}{\langle \mathbf{W}, \mathbf{X}\rangle}\\ &=&\boldsymbol{\beta}_{true}+\frac{\langle \mathbf{W}, \epsilon \rangle}{\langle \mathbf{W}, \mathbf{X}\rangle}\\ \end{eqnarray*}\]
The Frequentist view of statistics is relevant for this case because
when we fit a model, we want to use the procedure repeatedly. Thus, we
can do hypothetical repeated sampling. For each repetition, we get \(\epsilon_i\)’s, from which we get \(Y\), which gives us different \(\hat{\hspace{0pt}\boldsymbol{\beta}}\).
Then \(E(\hat{\hspace{0pt}\boldsymbol{\beta}})\)
is the long run average under repetition. \(Var(\hat{\hspace{0pt}\boldsymbol{\beta}})\)
is long run variance under repetition.
From the assumptions, we know \[\begin{eqnarray*} E\left(\langle \mathbf{W}, \epsilon\rangle\right)&=&\sum^n_{i=1}E\left(w_i \epsilon_i\right)\\ &=&\sum^n_{i=1}w_i E(\epsilon_i)\\ &=&0 \end{eqnarray*}\] and \[\begin{eqnarray*} Var\left(\langle \mathbf{W}, \epsilon\rangle \right)&=&Var\left(\sum^n_{i=1}w_i \epsilon_i \right)\\ &=&\sum^n_{i=1}Var(w_i \epsilon_i)\\ &=&\sum^n_{i=1}w_i^2Var(\epsilon_i)\\ &=&\lVert w \rVert^2\sigma^2\\ \end{eqnarray*}\]
Therefore \[\begin{eqnarray*} E\left( \hat{\hspace{0pt}\boldsymbol{\beta}} \right) &=& E\left(\boldsymbol{\beta}_{true}+ \frac{\langle \mathbf{W} \ \epsilon\rangle }{\langle \mathbf{W}\ \mathbf{X}\rangle} \right)\\ &=&\boldsymbol{\beta}_{true}\\ \end{eqnarray*}\] \[\begin{eqnarray*} Var\left(\hat{\hspace{0pt}\boldsymbol{\beta}}\right) &=& Var\left(\boldsymbol{\beta}_{true}+ \frac{\langle \mathbf{W} \ \epsilon\rangle }{\langle \mathbf{W}\ \mathbf{X}\rangle }\right)\\ &=&Var\left(\frac{\langle \mathbf{W} \ \epsilon\rangle }{\langle \mathbf{W}\ \mathbf{X}\rangle }\right)\\ &=&\frac{\lVert \mathbf{W} \rVert^2\sigma^2}{|\langle \mathbf{W}\ \mathbf{X} \rangle |^2}\\ &=&\frac{\lVert \mathbf{W} \rVert^2\sigma^2}{(\lVert \mathbf{W}\lVert \lVert \mathbf{X} \lVert cos\theta)^2}\\ &=&\frac{\sigma^2}{\lVert \mathbf{X} \rVert^2 cos^2\theta} \end{eqnarray*}\] Thus we see in general \(Var(\hat{\hspace{0pt}\boldsymbol{\beta}})\) is minimized when \(cos^2\theta=1\), which is when \(\theta=0\), hence \(\mathbf{W}\propto \mathbf{X}\). Indicating that when \(\mathbf{W} = c\mathbf{X}_i\) for some constant \(c\), \(\hat{\hspace{0pt}\boldsymbol{\beta}}\) will have minimum variance.
Alternatively we can start from objective function \(R(\boldsymbol{\beta})=\sum^{n}_{i=1}{(y_i-\mathbf{X}_i\boldsymbol{\beta})^2}\) and try to find \(\boldsymbol{\beta}\) that minimizes \(R(\boldsymbol{\beta})\). \[\begin{eqnarray*} R(\boldsymbol{\beta})'&=&-2\sum^{n}_{i=1}{(y_i-x_i\boldsymbol{\beta})x_i}=0\\ &\Rightarrow&\sum^{n}_{i=1}{(x_i^2\boldsymbol{\beta})}=\sum^{n}_{i=1}{(y_ix_i)}\\ &\Rightarrow&\hat{\hspace{0pt}\boldsymbol{\beta}}_{LS}=\frac{\sum^{n}_{i=1}{(y_ix_i)}}{\sum^{n}_{i=1}{(x_i^2)}}\\ \end{eqnarray*}\] The resulting \(\boldsymbol{\beta}_{LS}\) is the least squares estimate of \(\boldsymbol{\beta}\). Which we see is a particular case of the general estimating equation when \(w_i=\mathbf{X}_i\). Because of this result, the Least Squared Estimate is called the best unbiased linear estimator (BLUE).
We can do the same for more general \(p\), for \[\begin{eqnarray*} \mathbf{X}=\left[ \begin{array}{cccc} x_{11}&x_{12}&\cdots &x_{1p} \\ \vdots&\vdots&\ddots&\vdots\\ x_{n1}&x_{n2}&\cdots &x_{np} \\ \end{array} \right]\verb|, | Y=\left[ \begin{array}{c} y_{1} \\ \vdots\\ y_{n} \\ \end{array} \right]\verb|, | \epsilon=\left[ \begin{array}{c} \epsilon_{1} \\ \vdots\\ \epsilon_{n} \\ \end{array} \right] \verb|, and | \boldsymbol{\beta}=\left[ \begin{array}{c} \beta_{1} \\ \vdots\\ \beta_{p} \\ \end{array} \right] \end{eqnarray*}\]
We still have the same assumption
Again define objective function as \(R(\boldsymbol{\beta})=\sum^{n}_{i=1}{\lVert Y-\mathbf{X}\boldsymbol{\beta}\rVert^2}\), least square estimate of \(\boldsymbol{\beta}\), \(\boldsymbol{\beta}_{LS}\) is the one that minimizes this loss function. We can derive \(\boldsymbol{\beta}_{LS}\) via taking the derivative of \(R(\boldsymbol{\beta})\) with respect to \(\boldsymbol{\beta}\) and setting that equal to zero. Hence
\[\begin{eqnarray*} R(\boldsymbol{\beta})'&=&-2\mathbf{X}^{T}{(Y-\mathbf{X}\boldsymbol{\beta})}\\ &\Rightarrow&-2(\mathbf{X}^{T}Y-\mathbf{X}^{T}\mathbf{X}\boldsymbol{\beta})=0\\ &\Rightarrow&\mathbf{X}^{T}\mathbf{X}\boldsymbol{\beta}=\mathbf{X}^{T}Y\\ &\Rightarrow&\hat{\hspace{0pt}\boldsymbol{\beta}}_{LS}=(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}Y\\ \end{eqnarray*}\] So, our estimated value of \(Y\) in the space spanned by \(\mathbf{X}\) using the least squares estimate of the \(\boldsymbol{\beta}\) can be written as \[\begin{eqnarray*} \hat{\hspace{0pt}Y}=\mathbf{X}\hat{\hspace{0pt}\boldsymbol{\beta}}_{LS} = \mathbf{X}(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}Y\\ \end{eqnarray*}\]
Because \(\hat{\hspace{0pt}Y}\) lives in space spanned by \(X\) since \(X\boldsymbol{\beta}\) is linear combination of \(\mathbf{X}\), \(\mathbf{X}(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\) can be thought of as projection matrix that projects \(Y\) onto space spanned by \(\mathbf{X}\). This matrix is often denoted by capital letter \(H\) and called the hat matrix \(H=\mathbf{X}(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\). The hat matrix is a symmetric matrix that implies \(H^T=H\) and \(H^2=H\).
Under our assumption, least square estimates have the following property. \[\begin{eqnarray*} E(\hat{\hspace{0pt}\boldsymbol{\beta}}_{LS})&=&E((\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}Y)\\ &=&E((\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}(\mathbf{X}\boldsymbol{\beta}_{true}+\epsilon))\\ &=&\boldsymbol{\beta}_{true}\\ \end{eqnarray*}\] \[\begin{eqnarray*} Var(\hat{\hspace{0pt}\boldsymbol{\beta}}_{LS})&=&Var((\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}Y)\\ &=&Var((\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}(\mathbf{X}\boldsymbol{\beta}_{true}+\epsilon))\\ &=&Var((\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\epsilon)\\ &=&(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}Var(\epsilon)((\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T})^T\\ &=&(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\sigma^2I((\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T})^T\\ &=&\sigma^2(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{X}(\mathbf{X}^{T}\mathbf{X})^{-1}\\ &=&\sigma^2(\mathbf{X}^{T}\mathbf{X})^{-1}\\ \end{eqnarray*}\]
Going back to the general estimating equation, \[\begin{eqnarray*} \hat{\hspace{0pt}\boldsymbol{\beta}}=(\mathbf{W}^T\mathbf{X})^{-1}(\mathbf{W}^TY)&=&(\mathbf{W}^T\mathbf{X})^{-1}(\mathbf{W}^T\mathbf{X}\boldsymbol{\beta}_{true}+\epsilon)\\ &=&\boldsymbol{\beta}_{true}+(\mathbf{W}^T\mathbf{X})^{-1}\mathbf{W}^T\epsilon \end{eqnarray*}\] \[\begin{eqnarray*} E(\hat{\hspace{0pt}\boldsymbol{\beta}})&=&E(\boldsymbol{\beta}_{true}+(\mathbf{W}^T \mathbf{X})^{-1}\mathbf{W}^T\epsilon)\\ &=&E(\boldsymbol{\beta}_{true})+E((\mathbf{W}^T \mathbf{X})^{-1}\mathbf{W}^T\epsilon)\\ &=&\boldsymbol{\beta}_{true}+(\mathbf{W}^T \mathbf{X})^{-1}\mathbf{W}^T E(\epsilon)\\ &=&\boldsymbol{\beta}_{true}\\ \end{eqnarray*}\] \[\begin{eqnarray*} Var(\hat{\hspace{0pt}\boldsymbol{\beta}})&=&Var(\boldsymbol{\beta}_{true}+(\mathbf{W}^T \mathbf{X})^{-1}\mathbf{W}^T\epsilon)\\ &=&Var((\mathbf{W}^T \mathbf{X})^{-1}\mathbf{W}^T\epsilon)\\ &=&(\mathbf{W}^T \mathbf{X})^{-1}\mathbf{W}^T Var(\epsilon)((\mathbf{W}^T \mathbf{X})^{-1}\mathbf{W}^T)^T\\ &=&(\mathbf{W}^T \mathbf{X})^{-1}\mathbf{W}^T \sigma^2I((\mathbf{W}^T \mathbf{X})^{-1}\mathbf{W}^T)^T\\ &=&\sigma^2(\mathbf{W}^T \mathbf{X})^{-1}\mathbf{W}^T ((\mathbf{W}^T \mathbf{X})^{-1}\mathbf{W}^T)^T\\ \end{eqnarray*}\]
If we denote \((\mathbf{W}^T \mathbf{X})^{-1}\mathbf{W}^T\) as \(A\) then \[\begin{eqnarray*} Var(\hat{\hspace{0pt}\boldsymbol{\beta}})&=&\sigma^2\lVert A\rVert^2\\ \end{eqnarray*}\]
We will use the following consequence of Cauchy-Schwarz inequality due to the symmetry of the hat matrix \[\begin{eqnarray*} \lVert Y\rVert^2 \geq \lVert \hat{\hspace{0pt}Y}\rVert^2 \Rightarrow Y^TY \geq (HY)^T(HY)=Y^THY \end{eqnarray*}\] which is true for all \(Y\).
Hence \[\begin{eqnarray*} A^TA&\geq&A^TH^THA \\ &=&(\mathbf{W}^T \mathbf{X})^{-1}\mathbf{W}^T \mathbf{X}(\mathbf{X}^T \mathbf{X})^{-1}\mathbf{X}^T (\mathbf{W}^T \mathbf{X})^{-1}\mathbf{W}^T)^T\\ &=&(\mathbf{X}^T \mathbf{X})^{-1} \end{eqnarray*}\]
Therefore we have \(Var(\hat{\hspace{0pt}\boldsymbol{\beta}})\geq \sigma^2(\mathbf{X}^T \mathbf{X})^{-1}=Var(\hat{\hspace{0pt}\boldsymbol{\beta}}_{LS})\), telling us once again that the least squares archive the minimum variance of the general estimator estimate, proving the BLUE property of least squares estimate.
Up to here, we assumed we knew that the true model is a linear combination of \(\mathbf{X}\), but we know that is not true in general. However, that still does not invalidate the use of least squares estimate, as Box’s quote has it, ``All models are wrong, but some are useful’’. So, let us look at how our model will behave under more general assumptions.
We will assume \(Y=f(\mathbf{X})+\epsilon\) where \(f\) is some unknown function, and \(E[\epsilon]=0\) and \(Var[\epsilon]=\sigma^2I\). Also we will assume \(\mathbf{X}_1,\mathbf{X}_2,\cdots,\mathbf{X}_p\) are orthonormal. We will not assume a linear model for \(f\), but we will try to estimate \(Y\) using the method of least squares. Thus \(\hat{\hspace{0pt}Y}=H(f+\epsilon)=\hat{\hspace{0pt}f}+\hat{\hspace{0pt}\epsilon}\). \(\hat{\hspace{0pt}f}=\mathbf{X}\boldsymbol{\beta}_{best}\) where \(\boldsymbol{\beta}_{best}\) is the best we can do knowing our true model \(f\) is not linear. Also \(\hat{\hspace{0pt}\epsilon}=\mathbf{X}\delta\) where \(\delta=[\delta_1,\delta_2,\cdots,\delta_p]^T\) and \(E[\delta_i]=0\), \(Var(\delta_i)=\sigma^2\) hence \(E(\epsilon)=0\) and \(E\left[\lVert\epsilon\rVert^2\right]=E\left[ \lVert \mathbf{X}_1\rVert^2\delta_1^2+\cdots+\lVert\mathbf{X}_p\rVert^2\delta_p^2\right]=p\sigma2^2\) which follows since \(\lVert \mathbf{X}_i\rVert^2=1\) from orthonormal assumption we made earlier.
| obs | predictor | f | randomn in train | training data | randomness in test | testing data | |||
|---|---|---|---|---|---|---|---|---|---|
| 1 | \(x_{11}\) | \(x_{12}\) | \(\cdots\) | \(x_{1p}\) | \(f_1\) | \(\epsilon_1\) | \(y_1=f_1+\epsilon_1\) | \(\epsilon_{1,new}\) | \(y_{1,new}=f_1+\epsilon_{1,new}\) |
| 2 | \(x_{21}\) | \(x_{22}\) | \(\cdots\) | \(x_{2p}\) | \(f_2\) | \(\epsilon_2\) | \(y_2=f_2+\epsilon_2\) | \(\epsilon_{2,new}\) | \(y_{2,new}=f_2+\epsilon_{2,new}\) |
| \(\vdots\) | \(\vdots\) | \(\vdots\) | \(\ddots\) | \(\vdots\) | \(\vdots\) | \(\vdots\) | \(\vdots\) | \(\vdots\) | \(\vdots\) |
| n | \(x_{n1}\) | \(x_{n2}\) | \(\cdots\) | \(x_{np}\) | \(f_n\) | \(\epsilon_n\) | \(y_n=f_n+\epsilon_n\) | \(\epsilon_{n,new}\) | \(y_{n,new}=f_n+\epsilon_{n,new}\) |
| ——— | ———– | ———- | ———- | ———- | ——— | ————– | ———————- | ——————- | ——————————— |
| \(\mathbf{X}_{1}\) | \(\mathbf{X}_{2}\) | \(\cdots\) | \(\mathbf{X}_{p}\) | \(f\) | \(\epsilon\) | \(Y=f+\epsilon\) | \(\epsilon_{new}\) | \(Y_{new}=f+\epsilon_{new}\) |
Under this setting, let’s look at the training and testing errors and their behavior. - Training Error \[\begin{eqnarray*} E\left[\lVert Y-\hat{\hspace{0pt}Y}\rVert^2\right]&=&E\left[\lVert (f+\epsilon)-(\hat{\hspace{0pt}f}+\hat{\hspace{0pt}\epsilon})\rVert^2\right]\\ &=&E\left[\lVert (f-\hat{\hspace{0pt}f})+(\epsilon-\hat{\hspace{0pt}\epsilon})\rVert^2\right]\\ &=&E\left[\lVert (f-\hat{\hspace{0pt}f})\rVert^2+\lVert(\epsilon-\hat{\hspace{0pt}\epsilon})\rVert^2-2\langle(f-\hat{\hspace{0pt}f})\ (\epsilon-\hat{\hspace{0pt}\epsilon})\rangle\right]\\ &=&\lVert (f-\hat{\hspace{0pt}f})\rVert^2+E\left[\lVert\epsilon\rVert^2-\lVert\hat{\hspace{0pt}\epsilon}\rVert^2\right]\\ &=&\lVert (f-\hat{\hspace{0pt}f})\rVert^2+n\sigma^2-p\sigma^2\\ \end{eqnarray*}\]
Which follows from the fact that \(\hat{\hspace{0pt}\epsilon}\) is the projection of \(\epsilon\),
\[\begin{eqnarray*} E\left[\lVert\epsilon-\hat{\hspace{0pt}\epsilon}\rVert^2\right]&=&E\left[\lVert\epsilon\rVert^2\right]+E\left[\lVert\hat{\hspace{0pt}\epsilon}\rVert^2\right] -2E\left[\langle \epsilon\ \hat{\hspace{0pt}\epsilon}\rangle\right]\\ &=&n\sigma^2+p\sigma^2-2p\sigma^2\\ &=&n\sigma^2-p\sigma^2\\ \end{eqnarray*}\] Thus we get the Pythagoras Theorem \(\lVert\epsilon-\hat{\hspace{0pt}\epsilon}\rVert^2=\lVert \epsilon\rVert^2-\lVert\hat{\hspace{0pt}\epsilon}\rVert^2\).
But for testing error this is different since \(\hat{\hspace{0pt}\epsilon}\) is not projection of \(\epsilon_{new}\):
Estimation Error \[\begin{eqnarray*} E\left[\lVert \hat{\hspace{0pt}Y}-f\rVert^2\right]&=&E\left[\lVert (\hat{\hspace{0pt}f}+\hat{\hspace{0pt}\epsilon})-f\rVert^2\right]\\ &=&E\left[\lVert (f-\hat{\hspace{0pt}f})+\hat{\hspace{0pt}\epsilon}\rVert^2\right]\\ &=&E\left[\lVert (f-\hat{\hspace{0pt}f})\rVert^2+\lVert(\hat{\hspace{0pt}\epsilon})\rVert^2-2\langle(f-\hat{\hspace{0pt}f})\ \hat{\hspace{0pt}\epsilon}\rangle\right]\\ &=&\lVert (f-\hat{\hspace{0pt}f})\rVert^2+p\sigma^2\\ &=&\verb|estimation error| = \verb|training error| - n\sigma^2 + 2p\sigma^2 \end{eqnarray*}\]
Error of \(\boldsymbol{\beta}_{best}\) \[\begin{eqnarray*} E\left[\lVert Y-\hat{\hspace{0pt}f}\rVert^2\right]&=&E\left[\lVert (f+\epsilon)-\hat{\hspace{0pt}f}\rVert^2\right]\\ &=&E\left[\lVert (f-\hat{\hspace{0pt}f})+\epsilon\rVert^2\right]\\ &=&E\left[\lVert (f-\hat{\hspace{0pt}f})\rVert^2+\lVert(\epsilon)\rVert^2-2\langle(f-\hat{\hspace{0pt}f})\ \epsilon\rangle\right]\\ &=&\lVert (f-\hat{\hspace{0pt}f})\rVert^2+n\sigma^2\\ \end{eqnarray*}\] Also we see that error of \(\boldsymbol{\beta}_{best}\) = training error + \(p\sigma^2\).
| training error | \(\lVert (f-\hat{\hspace{0pt}f})\rVert^2+n\sigma^2-p\sigma^2\) |
| estimation error | \(\lVert (f-\hat{\hspace{0pt}f})\rVert^2+p\sigma^2\) |
| error of \(\boldsymbol{\beta}_{best}\) | \(\lVert (f-\hat{\hspace{0pt}f})\rVert^2+n\sigma^2\) |
| testing error | \(\lVert (f-\hat{\hspace{0pt}f})\rVert^2+n\sigma^2+p\sigma^2\). |
Thus, we see that
| when \(n>2p\) | estimation error \(\leq\) training error \(\leq\) error of \(\boldsymbol{\beta}_{best}\) \(\leq\) testing error |
| when \(2p>n>p\) | training error \(\leq\) estimation error \(\leq\) error of \(\boldsymbol{\beta}_{best}\) \(\leq\) testing error |
| when \(n=p\) | training error \(\leq\) estimation error \(=\) error of \(\boldsymbol{\beta}_{best}\) \(\leq\) testing error |
| when \(n<p\) | training error \(\leq\) error of \(\boldsymbol{\beta}_{best}\) \(\leq\) estimation error \(\leq\) testing error |
This shows that the relative size of n to p affects the relative size of the estimation error to the other errors, but the relative size of the estimation error, error of \(\boldsymbol{\beta}_{best}\), and testing error are stable. If we increase \(p\) we will have less training error because of the \(-p\sigma^2\) term, which also increases the testing error by the \(+p\sigma^2\) term.
When you plot the errors against p, you will see that the training error will always go down with an increase in \(p\), but the testing error will go down up to some point and then increase. This is because when you increase \(p\), not only does \(-p\sigma^2\) term decrease the error for training, but also, an increase in \(p\) will reduce the model bias term \(\lVert (f-\hat{\hspace{0pt}f})\rVert^2\). The reduction in error by the reduction in model bias is responsible for the initial decrease in testing error. However, at some point, the decrease in model bias gets overwhelmed by the \(+p\sigma^2\) term, which is the reason testing error increases at some point.
In extreme cases where \(Y\) is pure noise, that is \(f=0\). Then \(n=p\) and \(\epsilon=\hat{\hspace{0pt}\epsilon}\). Therefore, your training error will be 0, but when you test your model, then \(\epsilon_{new}+\hat{\hspace{0pt}\epsilon}=\epsilon_{new}+\epsilon\) which is more error in general. Generally speaking, you don’t want to be too clever and interpret too many patterns in your error, or else you will be penalized in your test.
For general estimator \(\tilde{\hspace{0pt}\boldsymbol{\beta}}\), let \(\hat{\hspace{0pt}f}=\mathbf{X}\boldsymbol{\beta}_{best}\), and let \(\mu=E[\tilde{\hspace{0pt}\boldsymbol{\beta}}]\). Then estimation error can be decomposed into model bias, estimation bias, and estimation variance.
\[\begin{eqnarray*} E\lVert f-\mathbf{X}\tilde{\hspace{0pt}\boldsymbol{\beta}}\rVert^2&=&E\lVert f-\hat{\hspace{0pt}f}+\mathbf{X}\boldsymbol{\beta}_{best}-\mathbf{X}\tilde{\hspace{0pt}\boldsymbol{\beta}}\rVert^2\\ &=&\lVert f-\hat{\hspace{0pt}f}\rVert^2+E\lVert \mathbf{X}\boldsymbol{\beta}_{best}-\mathbf{X}\tilde{\hspace{0pt}\boldsymbol{\beta}}\rVert^2\\ &=&\lVert f-\hat{\hspace{0pt}f}\rVert^2+\lVert \boldsymbol{\beta}_{best}-\mu \rVert^2 +E\lVert \mu - \tilde{\hspace{0pt}\boldsymbol{\beta}}\rVert^2 \\ \end{eqnarray*}\]If the \(\tilde{\hspace{0pt}\boldsymbol{\beta}}=\hat{\hspace{0pt}\boldsymbol{\beta}}_{LS}\), that is \(E[\hat{\hspace{0pt}\boldsymbol{\beta}}_{LS}]=\boldsymbol{\beta}_{best}\), then \(\lVert \boldsymbol{\beta}_{best}-\mu \rVert^2=0\).
Model bias and estimation bias can be very ambiguous. Suppose there are 20000 predictors, \(p=20000\) and \(n\), that are not quite as large as \(p\). To fit a model in such a situation, there are two ways to go about it, both of which will be discussed later under the topic of regularization. One is to do model selection, where you leave the most significant predictors in the model and exclude the rest. Another is to do a shrinkage estimation where you shrink most of the estimates down to zero. Essentially, you are doing the same thing; however, model selection will increase the model bias, whereas the shrinkage method will increase the estimation bias.
| obs. | predictor 1 | predictor 2 | \(\cdots\) | predictor p | \(f\) | randomness in training | training data | randomness in testing | testing data |
|---|---|---|---|---|---|---|---|---|---|
| 1 | \(x_{11}\) | \(x_{12}\) | \(\cdots\) | \(x_{1p}\) | \(f_1\) | \(\epsilon_1\) | \(y_1=f_1+\epsilon_1\) | \(\epsilon_{1,new}\) | \(y_{1,new}=f_1+\epsilon_{1,new}\) |
| 2 | \(x_{21}\) | \(x_{22}\) | \(\cdots\) | \(x_{2p}\) | \(f_2\) | \(\epsilon_2\) | \(y_2=f_2+\epsilon_2\) | \(\epsilon_{2,new}\) | \(y_{2,new}=f_2+\epsilon_{2,new}\) |
| \(\vdots\) | \(\vdots\) | \(\vdots\) | \(\ddots\) | \(\vdots\) | \(\vdots\) | \(\vdots\) | \(\vdots\) | \(\vdots\) | \(\vdots\) |
| n | \(x_{n1}\) | \(x_{n2}\) | \(\cdots\) | \(x_{np}\) | \(f_n\) | \(\epsilon_n\) | \(y_n=f_n+\epsilon_n\) | \(\epsilon_{n,new}\) | \(y_{n,new}=f_n+\epsilon_{n,new}\) |
| \(\mathbf{X}_{1}\) | \(\mathbf{X}_{2}\) | \(\cdots\) | \(\mathbf{X}_{p}\) | \(f\) | \(\epsilon\) | \(Y=f+\epsilon\) | \(\epsilon_{new}\) | \(Y_{new}=f+\epsilon_{new}\) |
For general estimator \(\tilde{\hspace{0pt}\boldsymbol{\beta}}=[\tilde{\hspace{0pt}\boldsymbol{\beta}}_1,\cdots,\tilde{\hspace{0pt}\boldsymbol{\beta}}_p]^T\), let \(\hat{\hspace{0pt}f}=X\tilde{\hspace{0pt}\boldsymbol{\beta}}\), and let \(\mu=E[\tilde{\hspace{0pt}\boldsymbol{\beta}}]=[\mu_{\tilde{\hspace{0pt}\boldsymbol{\beta}_1}},\cdots,\mu_{\tilde{\hspace{0pt}\boldsymbol{\beta}_p}}]^T\). Then, estimation error, training error, and testing error can be written as:
Estimation Error \[\begin{eqnarray*} E\left[\lVert f-\hat{\hspace{0pt}f}\rVert^2\right] &=& E\left[\lVert f-X\tilde{\hspace{0pt}\boldsymbol{\beta}}\rVert^2\right] \end{eqnarray*}\]
Training Error \[\begin{eqnarray*} E\left[\lVert Y-X\tilde{\hspace{0pt}\boldsymbol{\beta}}\rVert^2\right]&=&E\left[\lVert f+\epsilon-\hat{\hspace{0pt}f}\rVert^2\right]\\ &=&E\left[\rVert f-\hat{\hspace{0pt}f}+\epsilon\rVert^2\right]\\ &=&E\left[\rVert f-\hat{\hspace{0pt}f}\rVert^2+\rVert \epsilon\rVert^2+2\langle f-\hat{\hspace{0pt}f},\ \epsilon \rangle\right]\\ &=&E\left[\rVert f-\hat{\hspace{0pt}f}\rVert^2\right]+E\left[\rVert \epsilon\rVert^2\right]+2E\left[\langle f-\hat{\hspace{0pt}f},\ \epsilon \rangle\right]\\ &=&E\left[\rVert f-X\tilde{\hspace{0pt}\boldsymbol{\beta}}\rVert^2\right]+n\sigma^2-2E\langle \hat{\hspace{0pt}f},\ \epsilon \rangle\\ &=&\verb|estimation error|+n\sigma^2-2E\langle \hat{\hspace{0pt}f},\ \epsilon \rangle\\ \end{eqnarray*}\]
Testing Error \[\begin{eqnarray*} E\left[\rVert Y_{new}-X\tilde{\hspace{0pt}\boldsymbol{\beta}}\rVert^2\right]&=&E\left[\rVert f+\epsilon_{new}-\hat{\hspace{0pt}f}\rVert^2\right]\\ &=&E\left[\rVert f-\hat{\hspace{0pt}f}+\epsilon_{new}\rVert^2\right]\\ &=&E\left[\rVert f-\hat{\hspace{0pt}f}\rVert^2+\rVert \epsilon_{new}\rVert^2\right]\\ &=&E\left[\rVert f-X\tilde{\hspace{0pt}\boldsymbol{\beta}}\rVert^2\right]+n\sigma^2\\ &=&\verb|estimation error|+n\sigma^2\\ &=&\verb|training error|+2E\langle \hat{\hspace{0pt}f},\ \epsilon \rangle\\ \end{eqnarray*}\]
Let’s look closely at the following relationships.
You can see that the training error is decreased by \(2E\langle \hat{\hspace{0pt}f},\ \epsilon \rangle\) term, but the testing error is increased by the same \(2E\langle \hat{\hspace{0pt}f},\ \epsilon \rangle\) term. Generally speaking, \(E\langle \hat{\hspace{0pt}f},\ \epsilon \rangle\) term is like a correlation between your estimated signal \(\hat{\hspace{0pt}f}\) and noise \(\epsilon\). So it can be interpreted as how much noise your estimator \(\tilde{\hspace{0pt}\boldsymbol{\beta}}\) absorbs. Thus, when you fit a model, you want to control this in order to avoid over-fitting.
For Least Squares Estimate testing error =training error\(+2p\sigma^2\).If we let \(2p\sigma^2=2E\langle \hat{\hspace{0pt}f},\ \epsilon \rangle\) we get \(p=\frac{E\langle \hat{\hspace{0pt}f},\ \epsilon \rangle}{\sigma^2}\) hence we see that effective degrees of freedom in \(\tilde{\hspace{0pt}\boldsymbol{\beta}}\) is \(\frac{E\langle \hat{\hspace{0pt}f},\ \epsilon \rangle}{\sigma^2}\).